Fast evaluation of union-intersection expressions 

Philip Bille* Anna Pagh* Rasmus Pagh* 



Abstract 

We show how to represent sets in a linear space data structure such 
that expressions involving unions and intersections of sets can be com- 
puted in a worst-case efficient way. This problem has applications in 
e.g. information retrieval and database systems. We mainly consider 
the RAM model of computation, and sets of machine words, but also 
state our results in the I/O model. On a RAM with word size w, a 
special case of our result is that the intersection of m (preprocessed) 
sets, containing n elements in total, can be computed in expected time 
0(n(logw) 2 /w + km), where k is the number of elements in the in- 
tersection. If the first of the two terms dominates, this is a factor 
y; 1 -"! 1 ) faster than the standard solution of merging sorted lists. We 
show a cell probe lower bound of time Q(n/ (wrologm) + (1 — -^zr-)k), 
meaning that our upper bound is nearly optimal for small m. Our 
algorithm uses a novel combination of approximate set representations 
and word-level parallelism. 

1 Introduction 

Algorithms and data structures for sets play an important role in computer 
science. For example, the relational data model, which has been the domi- 
nant database paradigm for decades, is based on set representation and ma- 
nipulation. Set operations also arise naturally in connection with database 
queries that can be expressed as a boolean combination of simpler queries. 
For example, search engines report documents that are present in the inter- 
section of several sets of documents, each corresponding to a word in the 
query. If we fix the set of documents to be searched, it is possible to spend 
time on preprocessing all sets, to decrease the time for answering queries. 

The search engine application has been the main motivation in several 
recent works on computing set intersections [4, 11, 12]. All these papers as- 
sume that elements are taken from an ordered set, and are accessed through 
comparisons. In particular, creating the canonical representation, a sorted 
list, is the best possible preprocessing in this context. The comparison-based 
model rules out some algorithms that are very efficient, both in theory and 
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practice. For example, if the preprocessing produces a hashing-based dictio- 
nary for each set, the intersection of two sets S\ and S2 can be computed in 
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expected time 0(min(|Si|, (S^D). This is a factor 0(log(l + max(-^|, fg^ ))) 
faster than the best possible worst-case performance of comparison-based 
algorithms. 

In this paper we investigate non-comparison-based techniques for evalu- 
ating expressions involving unions and intersections of sets on a RAM. (In 
the search engine application this corresponds to expressions using AND 
and OR operators.) Specifically, we consider the situation in which each 
set is required to be represented in a linear space data structure, and pro- 
pose the multi-resolution set representation, which is suitable for efficient 
set operations. We show that it is possible in many cases to achieve running 
time that is sub-linear in the total size of the input sets and intermediate 
results of the expression. For example, we can compute the intersection of 
a number of sets in a time bound that is sub-linear in the total size of the 
sets, plus time proportional to the total number of input elements in the 
intersection. In contrast, all previous algorithms that we are aware of take 
at least linear time in the worst case over all possible input sets, even if the 
output is the empty set. The time complexity of our algorithm improves as 
the word size w of the RAM grows. While the typical word size of a modern 
CPU is 64 bits, modern CPU design is superscalar meaning that several 
independent instructions can be executed in parallel. This means that in 
most cases (with the notable exception of multiplication) it is possible to 
simulate operations on larger word sizes with the same (or nearly the same) 
speed as operations on single words. We expect that word-level parallelism 
may gain in importance, as a way of making use of the increasing parallelism 
of modern processor architectures. 



1.1 Related work 

1.1.1 Set union and intersection 

The problem of computing intersections and unions (as well as differences) 
of sorted sets was recently considered in a number of papers (e.g. [4, 12]) in 
an adaptive setting. A good adaptive algorithm uses a number of compar- 
isons that is close (or as close as possible) to the size of the smallest set of 
comparisons that determine the result. In the case of two sorted sets, this 
is the number of interleavings when merging the sets. In the worst case this 
number is linear in the size of the sets, in which case the adaptive algorithm 
performs no better than standard merging. However, adaptive algorithms 
are able to exploit "easy" cases to achieve smaller running time. Mirza- 
zadeh in his thesis [15] extended this line of work to arbitrary expressions 
with unions and intersections. These results are incomparable to those ob- 
tained in this paper: Our algorithm is faster for most problem instances, but 
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the adaptive algorithms are faster in certain cases. It is instructive to con- 
sider the case of computing the intersection of two sets of size n where the 
size of the intersection is relatively small. In this case, an optimal adaptive 
algorithm is faster than our algorithm only if the number of interleavings of 
the sorted lists (i.e., the number of sublists needed to form the sorted list of 
the union of the sets) is less than around n/w. 

Another idea that has been studied is, roughly speaking, to exploit asym- 
metry. Hwang and Lin [13] show that merging two sorted lists Si and £2 
requires 6(|<Si| log(l + comparisons, for < |S*2|, in the worst case 

over all input lists. This is significantly less than 0(|<Si| + |S2|) if \Si\ <C (S^l- 
This result was generalized to computation of general expressions involving 
unions and intersections of sets by Chiniforooshan et al. [11]. Given an ex- 
pression, and the sizes of the input sets, their algorithm uses a number of 
comparisons that is asymptotically equal to the minimum number of com- 
parisons required in the worst case over all such setsQ The bounds stated 
in [11] do not involve the size of the output, meaning that they pessimisti- 
cally assume the output to be the largest possible, given the expression and 
the set sizes. In contrast, our bounds will be output sensitive, i.e., involve 
also the size of the result of the expression. We further compare our result 
to that of [11] in section [L2l 

1.1.2 Approximate set representations 

There has been extensive previous work on approximate set representations, 
mainly motivated by applications in networking and distributed systems [6] . 
Much of this work builds upon the seminal paper on Bloom filters [5]. A 
Bloom filter for a set S is an approximate representation of S in the sense 
that for any x S the filter can be used to determine that x ^ S with 
probability close to 1. However, for an e fraction of elements not in S, 
called false positives, the Bloom filter is consistent with a set that includes 
these elements. The advantage of allowing some false positives, rather than 
storing S exactly, is that the space usage drops to around 0(n log(l/e)) bits, 
practically independent of the size of the universe of which S is a subset. 
Two Bloom filters for sets S\ and S2 can be combined to form a Bloom filter 
for Si n 52 (resp. Si U S2), in a very simple way: By taking bitwise AND 
(resp. OR) of the data structures. 

Bloom filters have been used in connection with computation of rela- 
tional joins, which are essentially multiset intersections, in the I/O model 
of computation. The idea is to use a Bloom filter for the smaller set to effi- 
ciently find most elements of the larger set that are not in the intersection. 

1 After personal communication with the authors, we have had confirmed that the 
algorithm described in [11] is not optimal in certain cases. Specifically, it does not always 
compute the union of sets in the optimal bound. However, the authors have informed us 
that the algorithm can be slightly modified to remove this problem. 
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If the Bloom filter can fit into internal memory, this is a highly efficient 
procedure for reducing the amount of data that needs to be considered in 
the join. The algorithm presented in this paper also uses approximate set 
representations to eliminate elements that will not contribute to the result. 
However, using Bloom filters does not appear to yield an efficient solution, 
essentially because the information pertaining to a particular element of S 
is distributed across the data structure. This makes it hard to locate the set 
of input elements represented by a particular Bloom filter. Instead, we use 
the approximate set representation of Carter et al. [9] (see also [16]), which 
consists of storing, in a compact way, the image of the set under a universal 
hash function. 

1.2 Setup and results 

We consider fully parenthesized expressions with binary operators. That 
is, we have a rooted binary tree with input sets at the leaves and internal 
nodes corresponding to union and intersection operations. Given the sizes 
of all input sets, we may associate with any node v two numbers (notation 
from [11]): 

• i/j( v ) is the maximum possible number of elements in the subexpression 
rooted at v. (Can be computed bottom-up by summing child values 
at union nodes, and choosing the minimum child value at intersection 
nodes.) 

• ip*(v) is the maximum possible number of elements in the subexpres- 
sion rooted at v that can appear in the final result. This is the mini- 
mum value of i/j( v ) on the path from v to the root. 

We denote by V the set of nodes in the expression (internal as well as leaves), 
and let vo denote the root node. 

Theorem 1 Given suitably preprocessed sets of total size n, we can compute 
the value of an expression with binary union and intersection operators in 
expected time 0(k' + X^ev \^~!tr~ l°§ 2 (^u))~l)> where k' is the number of 
occurrences in the input of elements in the result. Preprocessing of a set of 
size n\ uses linear space and expected time 0(nilogw). 

Theorem [1] requires some effort to interpret. We will first state some 
special cases of the result, and then discuss the general result towards the end 
of the section. It is not hard to see that the terms in the sum of Theorem Q] 
corresponding to intersection nodes do not affect the asymptotic value. That 
is, we could alternatively sum over the set of leaf nodes and union nodes 
in the expression. In the case where the expression is an intersection of m 
sets we can further improve our algorithm and analysis to get the following 
result: 
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Theorem 2 Given m preprocessed sets of total size n, we can compute the 
intersection of the sets in expected time 0(nlog 2 w/w + km), where k is the 
number of elements in the intersection. 

We show the following lower bound, implying that the time complexity 
of Theorem [2] is within a factor (logw) 2 mlogm of optimal, assuming w = 
(1 + 0(1)) log n. Our lower bound applies to the class of functions whose 
union-intersection expression has an intersection operation on any root-to- 
leaf path (an element needs to be in at least two input sets to appear in 
the result). Note that if there is a path consisting of only union operations, 
there exists a set where all elements must be included in the result, so this 
requirement is no serious restriction. 

Theorem 3 Let f be a function of m sets given by a union-intersection 
expression with an intersection node on any root-to-leaf path. For integers 
n and k < n/m, any (randomized) algorithm in the cell probe model that 
takes representations of sets Si,...,S m C {0,1}™, where X^l^l — n an ^ 
\f(Si,...,S m )\ < k, and computes \f(Si, S m )\ must use expected time 
at least Q(n/(wm log m) + (1 — ^^-)k) on a worst-case input. The lower 
bound holds regardless of how the sets are represented. 

It is possible to (coarsely) bound the sum of Theorem [T] in terms of the 
total size of the input sets and the U-depth of the expression (maximum 
number of unions on a root-to- leaf path) : 

Corollary 1 Given m preprocessed sets of total size n, we can compute 
the value of an expression of U-depth d with binary union and intersection 
operators in expected time 0(m + k' + ^(d + log w) 2 ), where k' is the number 
of occurrences in the input of elements in the result. 

Possibly the best way of understanding the general result in Theorem [1] 
is to compare the complexity to the comparison-based algorithm of [11]. 
Though it might not result in the best running time for our algorithm, we 
make the comparison in the case where any group of adjacent union opera- 
tors is arranged as a perfectly balanced tree in the expression tree (we could 
modify our algorithm to always make this change to the expression) . The al- 
gorithm of [11] takes an expression where operators have unbounded degree, 
and where union and intersection nodes alternate. It can be applied in our 
setting by combining groups of adjacent union and intersection operators. 
The time usage is at least Q(k' + Y^veV ' l P*( v )) (i n ^ct, the complexity also 
involves a logarithmic factor on each term, but it is not easily comparable 
to the factor in our result). Thus, if the word length is sufficiently large, 
e.g. w = (logn)^ 1 ), our algorithm gains a factor w 1 ' ^ compared to [11]. 

We observe that all our results immediately imply nontrivial results in 
the I/O model [1]. For the upper bounds, this is because any RAM algorithm 
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can be simulated in the same I/O bound as long as w is bounded by the 
number of bits in a disk block. In other words, if B is the number of words 
in a disk block, we can get I/O bounds by replacing w by Bw in the results. 
In fact, the power of 2 in the bounds can be reduced to 1 in this setting, as 
the I/O model does not count the cost of computation. Our lower bound 
also holds in the I/O model, with w replaced by Bw, independently of the 
size of internal memory. (The same proof applies.) 

1.3 Technical overview. 

Our results are obtained through non-trivial combination of several known 
techniques. We use the idea of Carter el al. [9] to obtain an approximate 
representation of a set by storing a set h(S) of hash function values rather 
than the set S itself. Storing the approximation in a naive way (using at 
least log n bits per element) does not lead to a significant speedup in general. 
Instead, a compact representation of the set h{S) is needed. We use a 
bucketed set representation, as in the dictionary of Brodnik and Munro [7], 
to get a compact representation of h(S) that is suitable for word-parallel 
set operations. Specifically, we show how set operations on small integers 
packed in words can be efficiently implemented, using ideas from [2,3]. This 
allows us to quickly approximate the intersection of any two sets in the sense 
that we get a compressed list of references to the elements in the intersection 
plus a small fraction of the elements not in the intersection. To compute the 
intersection we compute the intersection of the subsets of "candidates" in the 
standard way, using hashing. The generalization to the case of expressions 
involving arbitrary unions and intersections is an extension of this idea, 
using a variant of a technique from [11] to keep the sizes of the sets we have 
to deal with as small as possible. Our lower bound is shown by a reduction 
to multi-party communication complexity. 

2 Main algorithm and data structure 

In this section we present most of our algorithm and data structure, post- 
poning the material on word-level parallelism to Section [3] (which is used as 
a blackbox in this section). Specifically, we show how to reduce the problem 
of performing unions and intersections on sets of words to the problem of 
performing these operations on sets from a smaller universe. Due to space 
constraints, the time and space analysis is placed in Appendix |A"1 

2.1 Overview of special case: Intersection 

We first present the main ideas in the case where the expression is an inter- 
section of m sets. The basis of the approach is to map elements of {0, 1} W 
to a smaller universe using a hash function h, and compute the intersection 
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H = h(Si) n • • • n h(S m ). Now, if x G Si H • • • H S m then h(x) G i7. On the 
other hand, if x £ Si D • • ■ fl S m then, if /i is suitably chosen, we will have 
h(x) H with probability close to 1. Thus, we can regard H as representing 
a good approximation of Si fl • • • Pi S m . In particular, if we compute the sets 
S 4 ' = {x G Si | G if}, i = 1, . . . , m, we expect that S| does not contain 
many elements of Sj\(Si D • • • fl S m ). Since Si 3 S^ D Si fl • • • fl S m we can 
compute the intersection of Si, ■ ■ ■ , S m as S[ fl ■ ■ ■ fl S^ — using a standard 
linear time hashing-based algorithm. The challenge of this approach is to 
keep the cost of computing H and the sets S| low. We store preprocessed, 
compressed representations of the sets /i(Sj) using only 0(logu>) bits per 
hash value, which allows us to compute H in time that is sub-linear in the 
size of the input sets. The elements of S^ are extracted in additional time 
0(\Si\). The details of these steps appear in sections 12.31 and [3l Readers 
mainly interested in the case of computing a single intersection may skip 
the description of the general case in the next subsection. 

2.2 The general case 

In the rest of the paper we let / denote the function of m input sets given 
by the expression to be evaluated. Since /(Si, . . . , S m ) is monotone in the 
sense that adding an element to an input set can never remove an element 
from /(Si, . . . , S m ) we have that for any x G /(Si, . . . , S m ) it holds that 
h(x) G /(/i(Si), . . . , h(S m )). This means we can compute /(Si, . . . , S m ) by 
the following steps: 

1. Compute H = f(h(Si), h(S m )). 

2. For all i compute the set S[ = {x G Si \ h(x) G H}. 

3. Compute f(S' v . . . , S' m ) to get the result. 

We will show how, starting with a suitable, compressed representation of 
the sets h(S\), . . . , h(S m ), we can efficiently perform the first two steps such 
that the sets S[ are significantly smaller than the Sj in the following sense: 
Most of the elements that do not occur in /(Si, . . . , S m ) have been removed. 
This means that, except for negligible terms, the time for performing the 
third step, using the standard linear time hashing-based algorithm, depends 
on the number of input elements in the output rather than on the size of the 
input. Conceptually, the first step computes the expression on approximate 
representations of the sets Si, ... , S m . Then the information extracted from 
this is used to create a smaller problem instance with the same result, which 
is then used to produce the answer. 

Assume for now that h is given, and that we have access to data struc- 
tures for h(S\), . . . , h(S m ). The details on how to choose h appear in Sec- 
tion 12.31 The computation of /(/i(Si), . . . , h(S m )) is done bottom-up in the 
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expression tree in the same order as the algorithm of [11]: For an intersec- 
tion node v we first recursively process the child subtree whose root has the 
smallest value of tp* — the children of union nodes are processed recursively 
in arbitrary order. We adopt another idea of [11]: If the set computed for 
the subtree rooted at v has size more than 2tp*(v), we reduce the size of the 
set to at most ip*(v) by computing the intersection with the smallest child 
set of an intersection node on the path from v to the root. Observe that 
this will only remove elements that are not in the output. Due to the way 
we traverse the expression tree, the relevant child set will already have been 
computed. For every node v in the expression tree, we store the result Z v of 
the subexpression rooted at v. 

For the root node vo define I' VQ = I vo . To compute the sets we 
first traverse the tree top-down and compute for every non-root node v the 
intersection T' v = T' v P\T' p ,y where p(v) is the parent node of v. Observe that, 
by induction, T' V =T V C\ f(h(Si), . . . , h(S rn )). We will see that the time for 
this procedure is dominated by the time for computing f(h(Si), . . . , h(S m )). 
At the end we have computed h(S'j) = f(h(Si), . . . , h(S m )) n h(Si) for all i. 
All that remains is to find the corresponding elements of 5-, which is easily 
done by looking up the hash function values in a hash table that stores h(Si) 
with the corresponding elements of Si as satellite information. 

Finally, we compute f(S[, . . . , S' m ) by first identifying all duplicate ele- 
ments in the sets (by inserting them in a common hash table) , keeping track 
of which set each element comes from. Then for each element decide whether 
it is in the output by evaluating the expression. This can be done in time 
proportional to the number of occurrences of the element: First annotate 
each leaf and intersection node in the expression tree with the nearest an- 
cestor that is an intersection node. Then compute the set corresponding to 
each intersection node bottom-up. The time spent on an intersection node 
is bounded by the total size of the sets at intersection nodes immediately 
below it, but the intersection of these sets has size at most half of the total 
size. This implies the claimed time bound by a simple accounting argument. 

2.3 Data structure 

The best choice of h depends on the particular expression and size of input 
sets. For example, when computing the intersection Si n £2 we want the 
range of h to have size significantly larger than the smaller set (Si, say). 
This will imply that most elements in h(S2\S±) will not be in h(Si), and 
there will be a significant reduction of the problem instance in step 2 of the 
main algorithm. On the other hand, the time and space usage grows with 
the size of the range of the hash function used, so it should be chosen no 
larger than necessary. In conclusion, to be able to choose the most suitable 
one in a given situation, we wish to store the image of every set under several 
hash functions, differing in the size of their range. The images of the set 
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under various hash functions can be thought of as representations of the 
set at different resolutions. Hence, we name our data structure the multi- 
resolution set representation. As we show in Appendix El it suffices to use 
a hash function with range {0, l} r , where r = log n + 0(log w) and n is the 
total size of the input sets. 

The hash functions will all be derived from a single "mother" hash func- 
tion h*, a strongly universal hash function [8, 17] with values in the range 
{0,1}™. This is a global hash function that is shared for all sets. The 
hash function h r , for 1 < r < w is defined by h r (x) = h*(x) div 2 w ~ r , 
where "div" denotes integer division (we use the natural correspondence 
between bit strings and nonnegative integers). Note that h r has function 
values of r bits. To store h r (S) for a particular set S, r > log|5| + 1, 
requires \h r (S)\(r — log \h r (S)\ + 6(1)) bits, by information theoretical argu- 
ments. Since we may have (^(SOI = \S\ the space usage could be as high as 
\S\(r — log \S\ + 6(1)). Note that the required space per element is constant 
when r < log \ S\ +0(1), and then grows linearly with r. 

If we store h r (S) for all r, log \S\ < r < w, the space usage may be £l(w) 
times that of storing S itself. To achieve linear space usage we store h r (S) 
only for selected values of r, depending on \S\, namely r € {[log | S[\ + 
2 l | i = 0,1,2,..., [log(u; — log|5|)J}. These sets are stored using the 
bucketed set representation of Section [3] which gives a space usage for h r (S) 
of 0(\S\ (r — log \S\ + logiu)) bits. To get the representation of h r (S) for 
arbitrary r we access the stored representation of h r >, where r' > r, and 
throw away the r' — r least significant bits of its elements (see Section [3] for 
details). Choosing r 1 as small as possible minimizes the time for this step. 
We build the bucketed set representation of the largest value of r in 0(|5|) 
time by hashing, and then apply Lemma H] iteratively to get the structures 
for the lower values of r. 

The final thing we need is a hash table that allows us to look up a value 
h r {x) and retrieve the element (s) in S that have this value of h r . This can 
be done by using the [log most significant bits of h r as index into a 
chained hash table. Since the values of these bits are common for all h r , 
log \ S\ < r < w, we only need to store a single hash table. Note that the 
size of the hash table is 6(|S'|), which means that the expected lookup time 
is constant. 

3 Bucketed and packed sets 

We describe two representations of sets of elements from a small universe 
and provide efficient algorithms for computing union and intersection in 
the representations. Proofs of the lemmas in this section can be found in 
Appendix [Bj 
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3.1 Packed sets 



Given a parameter / we partition words into k = w/(f + 1) substrings, 
called fields, numbered from right to left. The most significant bit of a field 
is called the test bit and the remaining /-bits are called the entry. A word 
is viewed as an array A capable of holding up to k bit strings of length /. 
If the zth test bit is 1 we consider the iih field to be vacant. Otherwise 
the field is occupied and the bit string in the ith entry is interpreted as the 
binary encoding of a non-negative integer. If \A\ > k we can represent it in 
[~|^4|/fc] words; each storing up to k elements. We call an array represented 
in this way a packed array with parameter f (or simply packed array if / 
is understood from the context). For our purposes we will always assume 
that fields are capable of storing the total number of fields in a word, that is, 
/ > log k. In the following we present a number of useful ways to manipulate 
packed arrays. 

Suppose A is a packed array containing x occupied fields. Then, com- 
pacting A means moving all the occupied fields into the first x fields of A 
while maintaining the order among them. 

Lemma 1 (Andersson et al. [3]) A packed array A with parameter f can 
be compacted in O (\A\ \f 2 /w~\) time. 

Let X = xi, . . . , x m be a sequence of /-bit integers. If X is given as a 
packed array with parameter /, such that the ith field, 1 < i < m, holds Xi, 
we say that X is a packed sequence with parameter f. We use the following 
result: 

Lemma 2 (Albers and Hagerup [2]) Two sorted packed sequences X\ 
and X2 with parameter f can be merged into a single sorted packed sequence 
inO({\X 1 \ + \X 2 \) \f/w\) time. 

We refer to a sorted, packed sequence of integers as a packed set. 

Lemma 3 Given packed sets S\ and S 2 with parameter f , the packed sets 
SiUS^ and S\nS 2 with parameter / can be computed in 0((\Si \ + \S 2 \) \f/w\) 
time. 

3.2 Bucketed sets 

Let S be a set of Z-bit integers. For a given parameter b < I we partition 
S into 2 b subsets, So, • • • > S 2 b_i, called buckets. Bucket Si contain all values 
in the range [2^'~ b \ 2^ +1 ^' - ^ — 1], and therefore all values in Si agree on 
the b most significant bits. Hence, to represent Si it suffices to know the b 
most significant bits together with the set of the I — b least significants bits. 
We can therefore compactly represent S by an array of length 2 b , where 
the ith entry points to the packed set (with parameter I — b) of the I — b 
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least significant bits of Si. We say that S is a bucketed set with parameter 
b if it is given in this representation. Note that such an encoding of S uses 
0(2 b w + \S\(l — b)) bits. As above, we assume that fields in packed sets are 
capable of holding the number of fields in a word, that is, we assume that 
(l — b)> logw — log(7 — 6+1)) in any bucketed set. We need the following 
results to manipulate bucketed sets. 

Lemma 4 Let S be a bucketed set ofl-bit integers with parameter b. Then, 

1. Given an integer b' we can convert S into a bucketed set with parameter 
b' in time O (V ax ( 6 ' 6 ') + \S\ \{l - min(6, b')) 2 /w]\ . 

2. Given an integer b < x < I we can compute the bucketed set S' = 
{jdiv2 x | j € S} ofl—x bit integers with parameter b inO (2 h + \S\ \{l — 5) 2 /u>~|) 
time. 

Let S be a bucketed set of Z-bit integers with parameter b. We say 
that S is a balanced bucketed set if b is the largest integer such that b < 
log \S\ — logw. Intuitively, this choice of b balances the space for the array 
of buckets and the packed sets representing the buckets. Since / > log \S\ 
the condition implies that I — b > I — log \S\ + log w > log w — log(/ — 6+1). 
Hence, the field length of the packed sets representing the buckets in S is 
as required. Also, note that the space for a balanced bucketed set S is 
0{2 b w + \S\(l - b)) = 0(\S\(l - log \S\ + logw)). 

Lemma 5 Let S± and S2 be balanced bucketed sets of l-bit integers. The 
balanced bucketed sets S\ U S2 and S\ n S*2 can be computed in time 

0((|5i| + |5 2 |) \(l - logdS^ + \S 2 \) + logw) 2 /w]) . 

If I = B(log(|5i| + IS2Q) Lemma [5] provides a speedup by a factor of 
w/ log 2 w. 

4 Lower bound 

In this section we show Theorem [3l The proof uses known bounds from 
t-party communication complexity, where t communicating players are re- 
quired to compute a function of n-bit strings x\, . . . ,xt, where Xi is held by 
player i, using as little communication as possible. We consider the black- 
board model where a bit communicated by one player is seen by all other 
players, and consider the following binary functions: 

EQ(xi,X2) which has value 1 iff X\ = x 2 . (Here t = 2.) 
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DISJ nj t(xi, . . . ,xt) which has value 1 iff there is no position where two 
bit strings x% and Xj both have a 1 (i.e., all pairs are "disjoint"). We 
consider this problem under the unique intersection assumption, where 
either all pairs are disjoint, or there exists a single position where all 
bit strings have a 1. We allow the protocol to behave in any way if 
this is not the case. 

Solving EQ exactly requires communication of Q(n) bits, for both determin- 
istic and randomized protocols [14,18]. That is, the trivial protocol where 
one player communicates her entire bit string is optimal. Chakrabarti at 
al. [10] showed that solving DISJ n t exactly requires Q(n/(t logi)) bits of 
communication in expectation, even under the unique intersection assump- 
tion and when the protocol is randomized. 

Our main observation is that if sets Si, . . . ,St have been independently 
preprocessed, we can view any algorithm that computes f(S±,...,St) as 
a communication protocol where each player holds a set. Whenever the 
algorithm accesses the representation of Si it corresponds to w bits being 
sent by player i. Formally, given any (possibly randomized) algorithm that 
computes |/(<Si, . . . , St)\, where Si, ... ,St have been individually prepro- 
cessed in an arbitrary way, we derive communication protocols for EQ and 
DIS J „ )t , and use the lower bounds for these problems to conclude a lower 
bound on the expected number of steps used by the algorithm. We note that 
this reduction from communication complexity is different from the reduc- 
tion from asymmetric communication complexity commonly used to show 
data structures lower bounds. 

Let n and k, 1 < k < n/t, denote integers such that the algorithm 
correctly computes [/(Si, . . . , St)| provided that the sum of sizes of the sets 
is at most n+1, and that |/(Si, . . . , St)\ < k. Let r denote the number of cell 
probes on a worst-case input of this form. Given vectors x\,. . . ,xt € {0, l} n 
satisfying the unique intersection assumption, we consider the sets Si = 
{j | Xi has a 1 in position j} and their associated representations (which 
could be chosen in a randomized fashion). Observe that the total size of 
the sets is at most n + 1, and that |/(Si, . . . ,St)\ = if and only S\, . . . ,St 
are disjoint (using the assumptions on /). By simulating the algorithm on 
these representations, we get a communication protocol for DISJ using tw 
bits in expectation. By the lower bound on DISJ n t we thus have tw = 
fi(n/(t log i)) on a worst case input, i.e., r = Q(n/(wt logi)) cell probes are 
needed. 

Consider the function /'(Si, £2) = f(Si, . . . , Si, S2). Clearly, a lower 
bound on the cost of computing /' applies to / as well. We denote by 
( {0 '1 }W ) the set of subsets of {0, 1} W having size k. Let q = [log 2 \( {0 'l }w )\\, 
and let <f> be any injective function from {0, l} 9 to )• Given two 

vectors x,y E {0, l} q we consider the sets Si = <p(x) and S2 = <fi(y), which 
satisfy \f'(Si, S2)\ < k and (t — l)\Si\ + IS2I < n. Since (j) is injective, 
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we have that x = y iff \f'(Si, S^)! = k. Thus, similar to above we get a 
communication protocol for EQ that uses tw bits in expectation on a worst- 
case input. By the lower bound on EQ we have r = £l(q/w), implying that 
r = Q(k(w — log 2 k)/w). The maximum of our two lower bounds is a factor 
of at most two from the sum stated in the theorem, finishing the proof. 

5 Conclusion and open problems 

We have shown how to use two algorithmic techniques, approximate set rep- 
resentations and word-level parallelism, to accelerate algorithms for basic set 
operations. Potentially, the results (or techniques) could have a number of 
applications in problem domains such as databases (relational, textual,. . . ) 
where some preprocessing time (indexing) may be invested to keep the cost 
of queries low. 

It is an interesting problem whether our results can be extended to handle 
non-monotone set operators such as set difference. The technical problem 
here is that one would have to deal with two-sided errors in the estimates 
of the intermediate results. 

Acknowledgement. We thank Mikkel Thorup for providing us useful 
insight on the use of word-level parallelism on modern processors. 
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A Analysis of our algorithm 



Running time 

We now sketch the proof of Theorem [H using results on the set representation 
described in Section [3J Observe that the worst case for our algorithm occurs when 
any intermediate result of a node v has size ip( v )> so we need only consider this 
case. The intersection operations performed when the subresult of a node v has size 
at least 2ip*(v) can be regarded as "free", since at least half of the elements will 
be removed. We can thus charge the cost of this against the cost of computing the 
removed elements. It follows from Lemma [S] that the time for the remaining parts 
of the first two steps in the main algorithm use time 0(X^eV \ — ^ log 2 ( ^"°^™ )~|). 

The main part of the analysis is to bound the number of elements in S[ , . . . , S' m 
that are not part of the final result. We will show that in expectation there are 
0(n/w) such elements, implying that the time spent on these elements in the third 
step of the main algorithm is negligible. This will finish the argument, as the time 
spent in the third step on elements in the final result is captured by the 0(k) term. 

Consider an element x that is a member of one or more input sets, but not part 
of the result of evaluating the expression. Then there exists some input element 
y =/= x such that h(y) — h(x). It is a basic property of universal hash functions that 
this happens with the same probability as if h was a truly random function. By 
our choice of r, the probability that this happens for any particular x is 0(X/w). 
This means that the expected number of such elements is 0(n/w), as desired. 

In the case where we are computing an intersection, it suffices to choose r = 
log(mini | Si |) + 0(\ogw), implying the time bound stated in Theorem[2l 

Preprocessing time and space 

The space for the bucketed sets is geometrically increasing with i, so it is dominated 
by the largest structure which uses 0(ni) words, where n\ is the number of elements 
in the set. The hash table containing all elements of the set also uses 0(m) words. 
The preprocessing time is dominated by the time for creating the 0(log w) bucketed 
sets, each in linear time by LemmalU The creation of the hash table takes expected 
0(n{) time. 

B Proofs for section [3] 

Proof of Lemma To compute Si U S2 first merge sorted sequences representing 
Si and S2 into a new sequence X. Then subtract X by itself shifted 1 field to the 
right. The ith field in the result stores the value iff the ith field is a duplicate 
value in X. Using this we set all such fields to vacant and compact the resulting 
packed array thus producing the packed set representation of Si U S2. By Lem- 
mas [T] and [U and the fact that the subtraction can be done in O(l) time for each 
word the result follows. For Si n S2 we use the same algorithm, with the exception 
that one of each duplicate fields is set to occupied and all others are set to vacant. □ 

Proof of Lemma^ (1) Consider the case when b' > b. First, construct an array 
of length 2 b for the new buckets. Let B be the packed set for a bucket in the 
representation with parameter b. To compute the new representation we need to 
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repartition B into 2 b ~ b new buckets and convert each new bucket into packed sets 
with parameter I — b' . For the repartitioning, observe that B is sorted packed 
sequence and hence the new buckets are subsequences of B. It follows that we can 
traverse and repartition B in 0(2 b ~ b + \B\ \(l — b)/w]) time. Converting the packed 
sets can be done by setting b 1 ~ b most significant bits of all fields in the packed 
sequences to and then compact the fields as described in the previous section. In 
total we use 0(2 b + \S\ \(l — b) 2 /w\) time to convert S into a bucketed set with 
parameter b' . The case when b' < b follows similarly, except that packed sets need 
to be converted to parameter b' representation before they are repartitioned into 
new buckets. 

(2) Similar to the conversion of the buckets in the proof of (1) we mask out the 
the x least significant bits of each buckets and compact the fields. □ 

Proof of Lemma\^ First, convert Si and S2 into bucketed sets with parameter 
b such that b is largest integer satisfying b < log(|Si| + |S 2 |) — logu> using the 
algorithm from Lemma H] Next, perform the desired operation on each of the 2 b 
pairs of buckets using Lemma [3] producing a bucketed set S. Finally, convert S (if 
necessary) to a balanced bucketed set. □ 



C Improvement of algorithm for asymmetric in- 
tersections 

In some situations we can substantially improve the performance of the algorithm 
described above by doing a certain transformation of the problem, described in the 
following. Consider a maximal subexpression that is an intersection of input sets. 
We denote the number of input sets in the intersection by m'. We can improve 
our algorithm if this intersection is asymmetric in the sense that the smallest of 
the sets has n' elements, where n'm! is lower than the time needed to compute the 
(approximate) intersection in step 1 of the algorithm in Section [2] Then we may 
replace these sets by their intersection by looking up each element in the smallest 
set in each of the other sets to determine if it belongs to the intersection. We 
may create the balanced bucketed set and the hash table needed for the rest of 
the algorithm in linear time. Thus, the time for this step is O(n'm'), potentially 
reducing the running time. 
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